AVOIDING OVER-FITS:

Principal Components, Partial Least Squares, Regularization, and Re-Sampling

Marco R. Steenbergen

University of Zurich

Machine Learning Errors

Error Sources

  1. Irreducible error.
  2. Bias error.
  3. Variance error.

Irreducible Error

  • Error that cannot be eliminated, not even through fine-tuning or choosing a different algorithm.

  • Captured by \(\varepsilon\).

  • Sources include

    • Missing predictive features.

    • Measurement error.

    • Random shocks.

  • The remedy is to collect more and better data.

Bias Error

  • Bias means that we systematically err in our predictions.

  • Common causes include:

    • Stopping training too early—especially relevant for complex algorithms such as neural networks.

    • Under-fitting the data.

  • An under-fit means that we miss elements of the data generating process—our model is not sufficiently complex.

  • Remedies are non-interference with the learning process and increasing model complexity.

Variance Error

  • The model does not generalize well to new data; small changes in the data have large consequences for performance.

  • The major cause of variance error is over-fitting, which means any of the following:

    • The model is too complex.

    • The model is too flexible.

    • We capitalize on chance.

  • The remedy is to reduce model complexity.

An Example of Under- and Over-Fitting

Source: Seema Singh

Total Error and the Bias-Variance Trade-Off

  • Total error is \(B^2 + V + I\).

  • To reduce \(B\), we may have to increase model complexity, but this could increase \(V\).

  • To reduce \(V\), we may have to reduce model complexity, but this could increase \(B\).

  • The goal is to find the sweet spot.

Reducing the Risk of Over-Fitting

  • We know how to add complexity:

    • Add more features.

    • Add nonlinear effects of and interactions between existing features.

  • But how do we make models simpler?

  • One approach is to combine features:

    • Principal components.

    • Partial least squares.

  • Another approach is to penalize for over-fitting \(\rightarrow\) regularization.

  • Both approaches involve tuning parameters.

Principal Component Analysis

What Is Principal Component Analysis?

  • Principal component analysis (PCA) is a data reduction technique.

  • Consider the placement of German political parties on three issues \(\rightarrow \mathbb{R}^3\).

  • Is there a \(\mathbb{R}^2\) or \(\mathbb{R}^1\) space that adequately accounts for the data?

What Does PCA Do?

Transform a set of correlated features into a smaller set of principal components with the following properties:

  1. They are orthogonal.
  2. They are extracted in the order of the variance for which they account.

Extracting Principal Components

  • Given the \(P \times P\) covariance matrix \(\boldsymbol{S}\), the total variance is equal to \(T = \sum_{j=1}^P s_{jj}^2\).

  • We perform the spectral decomposition

    \[ \boldsymbol{S} = \boldsymbol{C} \boldsymbol{L} \boldsymbol{C}^\top = \sum_{j=1}^P \lambda_j \boldsymbol{c}_j \boldsymbol{c}_j^\top \]

  • Here,

    • \(\boldsymbol{L}\) is a diagonal matrix of eigenvalues, whose elements are \(\lambda_j\).

    • \(\boldsymbol{C}\) is a \(P \times P\) orthogonal matrix of eigenvectors or loadings, whose rows correspond to the features and whose columns correspond to the components.

Variance Accounted For and Data Reduction

  • Mathematically, \(\sum_{i=1}^P \lambda_i = T\).

  • Thus, the \(i\)th component accounts for \(100 \cdot \frac{\lambda_i}{\sum_i s_{ii}^2}\) percent of the total variance.

  • We typically retain the \(M < P\) components that account for most of the variance.

  • We treat \(M\) as a tuning parameter.

PCA as a Model

  • Instance \(i\)’s score on the \(m\)th principal component is

    \[ p_{im} = \sum_{p=1}^P c_{pm} x_{ip} \qquad(1)\] where

    • \(i\) indexes an instance

    • \(m\) indexes a component

    • \(p\) indexes the feature

    • \(c_{pm}\) is the loading

  • In matrix terms,

    \[ \boldsymbol{P} = \boldsymbol{X} \boldsymbol{C} \qquad(2)\]

Interpretation

  • We can use the loadings to make sense of the principal components.

  • However, the loadings are not unique.

  • We can account for the same portion of \(T\) through arbitrary orthogonal transformations of \(\boldsymbol{C}\) that we call rotations.

  • Rotation is our friend—a carefully selected rotation can simplify the interpretation.

Illustration

EU spending immigration
CDU 6.38 5.73 5.73
SPD 6.38 3.45 3.91
FDP 5.69 7.36 3.60
Gruenen 6.23 2.82 2.09
Linke 3.00 0.50 4.00
AfD 4.85 6.18 7.45
CSU 1.62 7.89 9.30
pca_fit <- prcomp(german_df,
                  scale = FALSE,
                  center = TRUE,
                  retx = FALSE)
Standard deviations (1, .., p=3):
[1] 3.4172813 2.1019527 0.9245449

Rotation (n x k) = (3 x 3):
                   PC1        PC2        PC3
EU          -0.2924737 -0.6993803  0.6521705
spending     0.6719890 -0.6355302 -0.3801739
immigration  0.6803602  0.3270605  0.6558517

The first component explains 68.9% of the total variance, the second 26.1%, and the third 5.0%.

# Settling on M = 2 components
pca_rotate <- varimax(pca_fit$rotation[,1:2])

Loadings:
            PC1    PC2   
EU           0.268 -0.709
spending     0.925       
immigration  0.270  0.705

                 PC1   PC2
SS loadings    1.000 1.000
Proportion Var 0.333 0.333
Cumulative Var 0.333 0.667

The 1st component is mostly about spending and thus economic in nature. The 2nd component is about open versus closed borders.

# Done here by hand, but can be done more easily using the psych package.
# Step 1: Extract original loadings for PC1 and PC2.
L <- pca_fit$rotation[,1:2]
# Step 2: Extract rotation matrix from varimax rotation.
W <- pca_rotate$rotmat
# Step 3: Re-create rotated loadings.
M <- L %*% W
# Step 4: Center the features.
X_centered <- as.matrix(german_df) - colMeans(german_df)
# Step 5: Generate scores.
scores <- X_centered %*% M
scores
             [,1]       [,2]
CDU      1.374228 -0.6587567
SPD     -1.426580 -1.7698443
FDP      2.102471 -1.2593214
Gruenen -2.338837 -3.1188807
Linke   -5.036796  0.6904769
AfD      1.823888  2.0510158
CSU      3.058649  5.2337300

Selecting the Number of Components

  • We somewhat haphazardly settled on 2 components based on explained variance.

  • In machine learning, we would generally rely on cross-validation to find the optimal number of components from the perspective of predictive performance.

Tuning Parameters and Cross-Validation

Of Tuning Parameters

  • A tuning parameter (a.k.a. hyper-parameter) is a parameter that affects the operation of the algorithm but cannot be estimated from the data.

  • Oftentimes, these tuning parameters concern model complexity.

  • An example is \(M\), the number of principal components to be retained.

  • Although the modeler ultimately sets the value of the tuning parameter, it is possible to let the data speak to that decision \(\rightarrow\) (cross-)validation.

  • Re-sampling is an out-of-sample method to assess whether results will generalize.

Why Engage in Re-Sampling?

  • The alternative is re-substitution—measure performance on the same data used to optimize performance in the first place.

  • The resulting error is known as the re-substitution error.

  • Re-substitution errors over-estimate performance (Efron, 1983).

  • This is why we look for the alternative of re-sampling.

An Overview of Re-Sampling Methods

  • We can further sub-divide the training set using a number of re-sampling approaches:

    1. A further application of the split sample approach
    2. \(k\)-fold and other cross-validation methods
    3. The bootstrap
  • In each case, we generate out-of-sample estimates of performance.

  • A discussion of the pros and cons of the various methods can be found in Molinaro et al. (2005).

The Three-Way Split of the Data

  • Earlier, we saw how to split the data into training and test sets.

  • We can divide the training set again, resulting in two components:

    • The training set proper

    • The validation set

  • The validation set allows us to assess how choices about tuning parameters in the training process generalize before we deploy the tuned model on our test set.

Validation Sets in tidymodels

library(rio)
library(tidymodels)
library(tidyverse)
tidymodels_prefer()
happy_df <- import("Data/whr23.dta")
row.names(happy_df) <- happy_df$country
happy_clean <- happy_df %>%
  select(happiness,logpercapgdp,socialsupport,healthylifeexpectancy,
         freedom,generosity,perceptionsofcorruption) %>%
  na.omit
set.seed(10)
happy_split <- initial_split(happy_clean, prop = 0.6, strata = happiness)
happy_train <- training(happy_split)
happy_test <- testing(happy_split)
val_set <- validation_split(happy_train, prop = 3/4)
val_set
# Validation Set Split (0.75/0.25)  
# A tibble: 1 × 2
  splits          id        
  <list>          <chr>     
1 <split [60/20]> validation

Often We Can Do Better

  • The split sample approach has a couple of disadvantages:

    • We train on a potentially small fraction of the original data, which induces a downward bias in performance.

    • We become only one value for the optimal tuning parameter, which means our sense of generalizability is limited.

  • If we have a lot of data, these problems are typically not all that severe.

  • In smaller data sets, we should and can do better through cross-validation or bootstrapping.

\(k\)-Fold Cross-Validation

  • Randomly assign instances to \(k\) folds (typically, \(k = 10\)).

  • Each fold is held out once for validation purposes.

  • Each fold is used \(k-1\) times for training purposes.

Variations on a Theme

  • The higher we set \(k\), the more data will be used for training at any given time.

  • An extreme case is leave out one CV (LOOCV), where \(k = n_1 - 1\).

  • In each run, bias is reduced but variance is increased.

  • Another way to reduce bias is to repeatedly assign instance to the folds, each time using a different seed.

  • Monte Carlo CV selects the folds randomly each time, which may cause folds to contain overlapping instances (Xu & Liang, 2001).

Cross-Validation in tidymodels

set.seed(1923)
vanilla_folds <- vfold_cv(happy_train, v = 10)
vanilla_folds
#  10-fold cross-validation 
# A tibble: 10 × 2
   splits         id    
   <list>         <chr> 
 1 <split [72/8]> Fold01
 2 <split [72/8]> Fold02
 3 <split [72/8]> Fold03
 4 <split [72/8]> Fold04
 5 <split [72/8]> Fold05
 6 <split [72/8]> Fold06
 7 <split [72/8]> Fold07
 8 <split [72/8]> Fold08
 9 <split [72/8]> Fold09
10 <split [72/8]> Fold10
set.seed(1923)
repeated_folds <- vfold_cv(happy_train, v = 10, repeats = 5)
repeated_folds
#  10-fold cross-validation repeated 5 times 
# A tibble: 50 × 3
   splits         id      id2   
   <list>         <chr>   <chr> 
 1 <split [72/8]> Repeat1 Fold01
 2 <split [72/8]> Repeat1 Fold02
 3 <split [72/8]> Repeat1 Fold03
 4 <split [72/8]> Repeat1 Fold04
 5 <split [72/8]> Repeat1 Fold05
 6 <split [72/8]> Repeat1 Fold06
 7 <split [72/8]> Repeat1 Fold07
 8 <split [72/8]> Repeat1 Fold08
 9 <split [72/8]> Repeat1 Fold09
10 <split [72/8]> Repeat1 Fold10
# ℹ 40 more rows
set.seed(1923)
loocv_folds <- loo_cv(happy_train)
loocv_folds
# Leave-one-out cross-validation 
# A tibble: 80 × 2
   splits         id        
   <list>         <chr>     
 1 <split [79/1]> Resample1 
 2 <split [79/1]> Resample2 
 3 <split [79/1]> Resample3 
 4 <split [79/1]> Resample4 
 5 <split [79/1]> Resample5 
 6 <split [79/1]> Resample6 
 7 <split [79/1]> Resample7 
 8 <split [79/1]> Resample8 
 9 <split [79/1]> Resample9 
10 <split [79/1]> Resample10
# ℹ 70 more rows
set.seed(1923)
mc_folds <- mc_cv(happy_train, prop = 9/10, times = 20)
mc_folds
# Monte Carlo cross-validation (0.9/0.1) with 20 resamples  
# A tibble: 20 × 2
   splits         id        
   <list>         <chr>     
 1 <split [72/8]> Resample01
 2 <split [72/8]> Resample02
 3 <split [72/8]> Resample03
 4 <split [72/8]> Resample04
 5 <split [72/8]> Resample05
 6 <split [72/8]> Resample06
 7 <split [72/8]> Resample07
 8 <split [72/8]> Resample08
 9 <split [72/8]> Resample09
10 <split [72/8]> Resample10
11 <split [72/8]> Resample11
12 <split [72/8]> Resample12
13 <split [72/8]> Resample13
14 <split [72/8]> Resample14
15 <split [72/8]> Resample15
16 <split [72/8]> Resample16
17 <split [72/8]> Resample17
18 <split [72/8]> Resample18
19 <split [72/8]> Resample19
20 <split [72/8]> Resample20

The Bootstrap

  • The bootstrap consists of repeatedly sampling \(n_1\) instances with replacement (Efron, 1979).

  • Sampled instances are used for training.

  • Non-sampled instances are used for validation.

The Bootstrap in tidymodels

set.seed(1923)
booted <- bootstraps(happy_train, times = 100)
booted
# Bootstrap sampling 
# A tibble: 100 × 2
   splits          id          
   <list>          <chr>       
 1 <split [80/29]> Bootstrap001
 2 <split [80/34]> Bootstrap002
 3 <split [80/27]> Bootstrap003
 4 <split [80/25]> Bootstrap004
 5 <split [80/26]> Bootstrap005
 6 <split [80/28]> Bootstrap006
 7 <split [80/27]> Bootstrap007
 8 <split [80/26]> Bootstrap008
 9 <split [80/24]> Bootstrap009
10 <split [80/29]> Bootstrap010
# ℹ 90 more rows

The Price of Bootstrapping and Cross-Validation

  • For each value of a tuning parameter, the model needs to be re-run multiple times.

  • This can be extremely slow for complex models.

  • This is why you often see that validation sets are used in deep learning.

Principal Component Regression and Partial Least Squares

Using Principal Components to Avoid an Over-Fit

  • The problem of over-fitting can be stated as: \(P\) weights are too many for the task.

  • We have seen that we can reduce \(P\) predictive features to \(M < P\) principal components.

  • If we use the components as the predictive features, then this should decrease the risk of over-fitting the data.

  • This is known as principal component regression.

Recipe for Principal Component Regression

  1. Select \(M\) through cross-validation.
  2. Use Equation 1 to create scores, \(p_{im}\), on \(M\) components.
  3. Train \(y_i = \beta^* + \sum_{m=1}^M \omega_m^* p_{im} + \varepsilon_i^*\).

A Limitation

  • From a predictive modeling perspective, the problem with PC regression is that it never considers the label when constructing component scores.

  • This means that PC regression is not designed to optimize the prediction of the label.

  • We can change that by not just considering the total variance of the predictive features, but also the covariance of those features with the label.

  • Partial least squares (PLS) allows this (Liu et al., 2022; Wold et al., 2001).

Partial Least Squares

For the \(m\)th component,

\[ \max_{\boldsymbol{\omega}_m} \text{Cor}^2(\boldsymbol{y},\boldsymbol{X}\boldsymbol{\omega}_m) \text{Var}(\boldsymbol{X}\boldsymbol{\omega}_m) \qquad(3)\]

subject to the constraint that \(\boldsymbol{\omega}_m\) is orthonormal.

As in PC regression, \(M\) is set through cross-validation. It is standard practice to standardize the predictive features.

Strategies for Tuning

  • Grid searches:

    • We pre-define a set of values that should be evaluated.

    • Can be inefficient when the number of parameters is large.

    • Can itself be broken down into regular and irregular grids.

  • Iterative searches:

    • We sequentially obtain new parameter combinations based on prior results.

    • Efficient in terms of number of passes, but each pass may take more time.

    • Can itself be broken down into Bayesian optimization and simulated annealing.

  • For our simple problem, we do a regular grid search; other approaches will be discussed later.

Setting Up for PLS

if (!require("remotes", quietly = TRUE)) {
  install.packages("remotes")
}
remotes::install_bioc("mixOmics")
head(happy_train)
                 happiness logpercapgdp socialsupport healthylifeexpectancy
Bangladesh           4.282        8.685         0.544                64.548
Chad                 4.397        7.261         0.722                53.125
Comoros              3.545        8.075         0.471                59.425
Congo (Kinshasa)     3.207        7.007         0.652                55.375
Egypt                4.170        9.367         0.726                63.503
Gambia               4.279        7.648         0.584                57.900
                 freedom generosity perceptionsofcorruption
Bangladesh         0.845      0.005                   0.698
Chad               0.677      0.221                   0.807
Comoros            0.470     -0.014                   0.727
Congo (Kinshasa)   0.664      0.086                   0.834
Egypt              0.732     -0.183                   0.580
Gambia             0.596      0.364                   0.883
happy_recipe <- recipe(happiness ~ ., data = happy_train) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_pls(all_numeric_predictors(),
           outcome = "happiness",
           num_comp = tune())
set.seed(20)
cv_folds <- vfold_cv(happy_train,
                     v = 10,
                     repeats = 5)
pls_grid <- tibble(num_comp = 1:6) 
happy_model <-
  linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")
happy_metric <- metric_set(rsq)
happy_flow <- workflow() %>%
  add_model(happy_model) %>%
  add_recipe(happy_recipe)

Selecting the Number of Components

set.seed(30)
happy_tune <- happy_flow %>%
  tune_grid(cv_folds,
            grid = pls_grid,
            metrics = happy_metric)
autoplot(happy_tune) +
  theme_light() +
  labs(title = "Parameter Tuning for PLS Model")

happy_best <-select_best(happy_tune)
happy_best
# A tibble: 1 × 2
  num_comp .config             
     <int> <chr>               
1        1 Preprocessor1_Model1

Finalizing the Model

num_comp <- select_best(happy_tune)
final_flow <- happy_flow %>%
  finalize_workflow(num_comp)
final_est <- final_flow %>%
  fit(happy_train)
tidy(final_est)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    5.60     0.0539     104.  2.16e-85
2 PLS1           0.534    0.0306      17.4 1.23e-28
final_recipe <- recipe(happiness ~ ., data = happy_train) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_pls(all_numeric_predictors(), outcome = "happiness", num_comp = 1)
pls_prep <- prep(final_recipe)
tidied_pls <- tidy(pls_prep, 2)
tidied_pls %>%
  group_by(component) %>%
  slice_max(abs(value), n = 5) %>%
  ungroup() %>%
  ggplot(aes(value, fct_reorder(terms,value))) +
  geom_col(show.legend = FALSE,  fill = "#31688EFF") +
  facet_wrap(~ component, scales = "free_y") +
  labs(y = NULL) +
  theme_bw()

Deployment on the Test Set

pls_testfit <- final_est %>%
  predict(happy_test) %>%
  bind_cols(happy_test) %>%
  metrics(truth = happiness, estimate = .pred)
pls_testfit
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.542
2 rsq     standard       0.823
3 mae     standard       0.406

Regularization

What Is Regularization?

  • Regularization adds a penalty for over-fitting to the loss function.

  • This penalty serves as a constraint on the optimization problem.

  • We can also think of this as a shrinkage estimator, with small coefficients being shrunk to the benefit of larger coefficients.

  • The most important regularization procedures are

    • The lasso.

    • Tikhonov regularization.

    • Elastic nets.

The Lasso

  • Lasso = least absolute shrinkage and selection operator.

  • We impose \(\sum_{j=1}^P \lvert \omega_j \rvert \leq t\).

  • Using Lagrange multipliers, the lasso loss is \(L_{\text{lasso}} = L_2 + \lambda \left( \sum_{j=1}^P \lvert \omega_j \rvert - t \right)\).

  • Retaining only the terms in \(\omega\), the optimization problem is

    \[ \min_{\boldsymbol{\omega}} \sum_{i=1}^{n_1} \left( \beta + \boldsymbol{x}_i^\top \boldsymbol{\omega} - y_i \right)^2 + \lambda \lvert \boldsymbol{\omega} \rvert \qquad(4)\]

  • \(\lambda\) is a tuning parameter.

What the Lasso Does

Variations on a Theme

  • We can think of the lasso as a special case of

    \[ \min_{\boldsymbol{\omega}} \sum_{i=1}^{n_1} \left( \beta + \boldsymbol{x}_i^\top \boldsymbol{\omega} - y_i \right)^2 + \lambda \lvert \boldsymbol{\omega} \rvert^q \]

  • For \(q = 1\), this becomes the lasso.

  • For \(q = 2\), this becomes Tikhonov regularization (a.k.a. ridge regression).

Tikhonov Regularization

  • We minimize the \(L_2\)-norm subject to \(\sum_{j=1}^P \omega_j^2 \leq t\).

  • Hence,

    \[ \min_{\boldsymbol{\omega}} \sum_{i=1}^{n_1} \left( \beta + \boldsymbol{x}_i^\top \boldsymbol{\omega} - y_i \right)^2 + \lambda \sum_{j=1}^P \omega_j^2 \qquad(5)\]

Elastic Nets

  • Elastic nets are a mixture of the lasso and Tikhonov regularization.

  • For \(\alpha \in [0,1]\), define the penalty

    \[ P(\alpha) = \sum_{j=1}^P \left[ \frac{1}{2} (1 - \alpha) \omega_j^2 + \alpha \lvert \omega_j \rvert \right] \qquad(6)\]

  • Then

    \[ \min_{\boldsymbol{\omega}} \sum_{i=1}^{n_1} \left( \beta + \boldsymbol{x}_i^\top \boldsymbol{\omega} - y_i \right)^2 + \lambda P(\alpha) \qquad(7)\]

  • Note this reduces to

    • the lasso for \(\alpha = 1\)

    • Tikhonov regularization for \(\alpha = 0\)

Elastic Nets in tidymodels

From PLS, we recycle:

  • The split sample.

  • The cross-validation setup.

  • The standardization.

We use dials for tuning instead of the tibble we used for PLS. We use a regular grid.

model_recipe <- recipe(happiness ~ ., data = happy_train) %>%
  step_normalize(all_numeric_predictors())
elastic_spec <-  linear_reg(penalty = tune(),
                            mixture = tune()) %>%
  set_mode("regression") %>%
  set_engine("glmnet")
elastic_wf <- workflow() %>%
  add_model(elastic_spec) %>%
  add_recipe(model_recipe)
elastic_metrics <- metric_set(rsq)
glmnet_param <- extract_parameter_set_dials(elastic_spec)
elastic_grid <- grid_regular(glmnet_param, levels = 5)
print(elastic_grid, n = 6)
# A tibble: 25 × 2
       penalty mixture
         <dbl>   <dbl>
1 0.0000000001   0.05 
2 0.0000000316   0.05 
3 0.00001        0.05 
4 0.00316        0.05 
5 1              0.05 
6 0.0000000001   0.288
# ℹ 19 more rows
doParallel::registerDoParallel()
set.seed(1561)
elastic_tune <- elastic_wf %>%
  tune_grid(cv_folds, grid = elastic_grid, metrics = elastic_metrics)
elastic_best <- select_best(elastic_tune)
elastic_best
# A tibble: 1 × 3
  penalty mixture .config              
    <dbl>   <dbl> <chr>                
1       1    0.05 Preprocessor1_Model05

elastic_updated <- finalize_model(elastic_spec,
                              elastic_best)
workflow_new <- elastic_wf %>%
  update_model(elastic_updated)
new_fit <- workflow_new %>%
  fit(data = happy_train)
tidy_elastic <- new_fit %>%
  extract_fit_parsnip() %>%
  tidy()
tidy_elastic
# A tibble: 7 × 3
  term                    estimate penalty
  <chr>                      <dbl>   <dbl>
1 (Intercept)               5.60         1
2 logpercapgdp              0.202        1
3 socialsupport             0.252        1
4 healthylifeexpectancy     0.166        1
5 freedom                   0.137        1
6 generosity                0            1
7 perceptionsofcorruption  -0.0818       1

Better Tuning

tidymodels shines in its handling of tuning parameters, allowing for optimized experimental designs. Here, we shall use a Latin hyper-cube. This ensures that we fill the entire tuning parameter space.

elastic_grid <- grid_latin_hypercube(glmnet_param,
                                     size = 25,
                                     variogram_range = 0.5,
                                     original = TRUE)
print(elastic_grid, n = 6)
# A tibble: 25 × 2
   penalty mixture
     <dbl>   <dbl>
1 4.79e-10   0.565
2 8.76e- 9   0.670
3 2.08e- 6   0.859
4 1.82e- 9   0.838
5 4.67e- 4   0.117
6 1.28e- 7   0.971
# ℹ 19 more rows
set.seed(101)
elastic_tune <- elastic_wf %>%
  tune_grid(cv_folds, grid = elastic_grid, metrics = elastic_metrics)
elastic_best <- select_best(elastic_tune)
elastic_best
# A tibble: 1 × 3
  penalty mixture .config              
    <dbl>   <dbl> <chr>                
1  0.0613   0.467 Preprocessor1_Model11

Changes for the Lasso and Tikhonov Regularization

elastic_spec <-  linear_reg(penalty = tune(),
                            mixture = 1) %>%
  set_mode("regression") %>%
  set_engine("glmnet")
elastic_spec <-  linear_reg(penalty = tune(),
                            mixture = 0) %>%
  set_mode("regression") %>%
  set_engine("glmnet")

References

Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, 7(1), 1–26. http://www.jstor.org/stable/2958830
Efron, B. (1983). Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation. Journal of the American Statistical Association, 78(382), 316–331. DOI: 10.1080/01621459.1983.10477973
Liu, C., Zhang, X., Nguyen, T.T., Liu, J., Wu, T., Lee, E., & Tu, X.M. (2022). Partial Least Squares Regression and Principal Component Analysis: Similarity and Differences Between Two Popular Variable Reduction Approaches. General Psychiatry, 35(1), e100662. DOI: 10.1136/gpsych-2021-100662
Molinaro, A.M., Simon, R., & Pfeiffer, R.M. (2005). Prediction Error Estimation: A Comparison of Resampling Methods. Bioinformatics, 21(15), 3301–3307. DOI: 10.1093/bioinformatics/bti499
Wold, S., Sjöström, M., & Eriksson, L. (2001). PLS-Regression: A Basic Tool of Chemometrics. Chemometrics and Intelligent Laboratory Systems, 58(2), 109–130. DOI: 10.1016/S0169-7439(01)00155-1
Xu, Q.-S., & Liang, Y.-Z. (2001). Monte Carlo Cross Validation. Chemometrics and Intelligent Laboratory Systems, 56(1), 1–11. DOI: 10.1016/S0169-7439(00)00122-2